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ABSTRACT: 



A speech recognizer system for use with a telecommunication network wherein an input signal 
generated onto the network from a first terminal is directed to a speech recognizer for estimating 
the verbal content of the input signal. The speech recognizer or associated equipment then 
directs an estimate of the verbal content as an output signal back to the first terminal, the 
estimate including one or more approximations of the verbal content of the input signal. At the 
first terminal the user then confirms a correct estimate, or selects from a plurality of 
approximations, the verbal content of the input signal. 



JEuropaisches Patentamt 
European Patent Office 
Office europeen des brevets 



@ EUROPEAN PATENT APPLICATION 

@ Application number : 95302109.4 @ Int. CI. 6 : H04M 1/27, H04M 3/50 

(g) Date of filing: 29.03.95 



© Priority: 06.04.94 US 223810 

@ Date of publication of application : 
11.10.95 Bulletin 95/41 

@ Designated Contracting States : 

AT BE CH DE DK ES FR GB GR IE IT U LU MC 
NL PT SE 

@ Applicant : AT & T Corp. 
32 Avenue of the Americas 
New York, NY 10013-2412 (US) 



@ Inventor : Haimi-Cohen, Raziel 
30 N. Derby Road, 
Springfield 

New Jersey 07081 (US) 
Inventor : Reed, Adam Victor 
23 Longfellow Terrace, 
Morganville 

New Jersey 07751 (US) 

@ Representative : Watts, Christopher Malcolm 
Kelway, Dr. et aJ 
AT&T (UK) Ltd. 
5, Mornington Road 
Woodford Green Essex, IG8 0TU (GB) 



(54) Speech recognition system with display for user's confirmation. 



(57) A speech recognizer system for use with a 
telecommunication network wherein an input 
signal generated onto the network from a first 
terminal is directed to a speech recognizer for 
estimating the verbal content of the input sig- 
nal. The speech recognizer or associated equip- 
ment then directs an estimate of the verbal 
content as an output signal back to the first 
terminal, the estimate including one or more 
approximations of the verbal content of the 
input signal. At the first terminal the user then 
confirms a correct estimate, or selects from a 
plurality of approximations, the verbal content 
of the input signal. 
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Field of the Invention 

The present invention relates to the field of voice 
or speech recognition and more specifically to 
speech recognition in a network. 5 



This invention also provides for error reduction in 
speech recognition systems comprising the steps of 
providing an estimate of the verbal content of an input 
signal to the user and receiving confirmation of a cor- 
rect approximation from the user. 



Background of the Invention 

Speech recognition is the process by which an 
audio input signal is received and the verbal content 
of the input signal is determined. The verbal content 
is then further processed to obtain the desired action. 
The verbal content can be a transcription of the 
speech of the input or merely a general statement of 
content. Additionally, the speech recognizer itself can 
be located at the user terminal, as in United States 
Patent No. 5,111,501 or in the network as in United 
States Patent No. 4,922,519. 

Since error exists in the determination of verbal 
content, systems have been established whereby the 
user is asked to repeat the request if the speech rec- 
ognizer is unsure of the content. In this regard, a 
study of customer interaction with speech recogniz- 
ers is reported in "Serving Customers With Automatic 
Speech Recognition - Human-Factors Issues" by 
Wattenbarger et al, AT&T Tech. J., May/ June 1993, 
pp. 28-41. 

Summary of the Invention 

The present invention is directed to a speech rec- 
ognizer system for use with a network comprising a 
first terminal from which an input signal is generated 
onto the network, speech recognizer means for esti- 
mating verbal content of the input signal, feedback 
means for directing an output signal comprising one 
or more approximations to the first terminal and con- 
firmation means including selection means at the first 
terminal for confirming the correct approximation. 

Preferably, the speech recognition means cre- 
ates output signals which are digital to increase trans- 
mission speed. It is also preferred that the first termi- 
nal have a visual display on which to display the es- 
timate, either one approximation at a time, all at once 
or a number therebetween. Of course, an audio feed- 
back of the estimate may be preferred in such situa- 
tions as a car phone where the user cannot easily and 
safely view a visual display or at a terminal without a 
visual display. 

The present invention further includes a method 
for speech recognition including the steps comprising 
placing an input signal onto a network, estimating the 
verbal content of the input signal on the network and 
transmitting the estimate back to the first terminal for 
confirmation. Of course, if the speech recognizer is 
certain of the speech content from the input signal, 
feedback of an estimate to the first terminal need not 
be performed. 



Brief Description of the Drawings 

The figure is a schematic block diagram of the 
10 system of the present invention. 

Description of the Preferred Embodiment 

In the preferred embodiment, the present inven- 
ts tion is used with a telecommunications network 1 
connected to a first terminal 2 from which a user gen- 
erates an input signal 4 onto the network 1. The input 
signal 4 is transmitted to a speech recognizer 6, also 
within the network 1, where an estimate, including 
20 one or more approximations of the verbal content of 
the input signal 4, is made. The estimate of the verbal 
content is converted to an output signal 8 which is 
transmitted back to the first terminal 2 for user con- 
firmation. Once the correct verbal approximation is 
25 confirmed by the user at the first terminal 2, the infor- 
mation is processed to complete the exchange, for ex- 
ample as described in United States Patent No. 
4,922,519 to Daudelin. 

In its preferred embodiment, the first terminal 2 
30 is an automatic device or a user operated device such 
as a telephone plugged into the network having a vis- 
ual display 20 such as Caller I.D. In practice, however, 
it is understood that any device producing a variable 
signal on the network can be used. 
35 Preferably, a user is able to speak into the micro- 
phone 22 of a telephone, for example, to recite a de- 
sired number to be dialed, to recite whether a call is 
to be collect, charged to a calling card, person-to-per- 
son, etc., to recite a person's name for which a num- 
40 ber is requested or to provide other information. The 
speech is generated onto the network as an input sig- 
nal 4, either in analog or digital format depending on 
the equipment making up the first terminal line inter- 
face 26. 

45 Within the network 1 is a speech recognizer 6 
which receives and processes the input signal 4. A 
suitable speech recognizer is a CONVERSANT CVIS, 
manufactured by AT&T. 

If the speech recognizer 6 is sure of the verbal 

50 content of the input signal 4 (greater than a predeter- 
mined percent certain, e.g. >90% certain) the speech 
recognizer 6 passes the information to the switch 10 
or other processing device as required. If, however, 
the speech recognizer 6 is not sure of the verbal con- 

55 tent of the input signal 4 it will provide an estimate of 
the verbal content in an output signal 8, comprising 
one or more approximations of the verbal content to 
the first terminal 2 for confirmation of the estimate or 
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selection of an approximation. 

Preferably, the first terminal 2 has a visual display 
20 and the output signal 8 is in a digital format when 
sent to the first terminal 2 to speed the movement of 
the output signal 8 on the network to the first terminal 5 
2. Preferably the terminal will have a modem to de- 
code the digital signal for presentation. Alternatively, 
the output signal 8 will be in DTMF, which can be used 
with most all current terminal system, that is decoded 
and presented in visual form at the terminal. The dis- w 
play 20 of the terminal 2 will then decode the signal 
8 and visually present it to the user. 

At the first terminal 2 the output signal 8 is re- 
ceived and at least a portion thereof, i.e. one approx- 
imation, is presented to the user for confirmation. In 15 
its preferred embodiment the most probable approx- 
imation will be displayed first, followed by the next 
most probable if the first is not selected or confirmed. 

Of course, although a visual display 20 is prefer- 
red, the presentation can be audio via a speaker 28, 20 
visual or both, with visual alone or in combination with 
audio being preferred in most instances due to the 
speed of presentation and reduced need for user 
alertness where the visual information can stay on a 
display 20 until the user wishes to remove it, by con- 25 
firming the correctness or selecting the next approx- 
imation. 

However, for car phones where the user has his 
eyes on the road or with telephones that do not have 
a visual display, audio presentation via speaker 28 is 30 
available alone or along with the visual display 20. 

In situations where an audio presentation is 
made, "barge-in" capability is especially important so 
a user does not need to wait for the end of the audio 
presentation to confirm a correct approximation or re- 35 
quest the next selection. The barge-in feature allows 
the user to make a confirmation or request another 
approximation, by depressing a key on the keypad 24 
or speaking into the microphone 22 during the presen- 
tation, thereby terminating the presentation of the 40 
previous approximation without having to listen to the 
entire presentation. 

The preferred visual display 20 can be of any 
type, including a Caller LD. where a line or more of al- 
phanumeric text is presented in an LCD display, a 45 
P.C. monitor, a CRT display, a vacuum fluorescent 
display, an LED display, a video telephone, a still im- 
age telephone, etc. 

In implementing the speech recognition system 
of the present invention, a communication protocol so 
must be defined for transmitting the output signal 8 
from the network 1 to the first terminal 2. Definition 
of the protocol requires that the variety of possible ter- 
minal types and visual displays present in the net- 
work be taken into account. Several methods are cur- 55 
rently envisioned herein, including a bidirectional pro- 
tocol, a terminal specific protocol and a unidirectional 
protocol. 



A bidirectional protocol requires that the terminal 
2 respond to the network prompt and describe the ca- 
pabilities of the terminal 2. The network can then di- 
rect an outputsignal 8 to the terminal 2 which matches 
the capabilities of the terminal 2. For example, if the 
line interface 26 of the first terminal 2 has a high 
speed modem, the output 8 will be set faster using the 
modem protocol. If the terminal 2 is a videophone or 
still image telephone, the system will generate an out- 
put signal 8 comprising a video image for transmis- 
sion to the visual display 20 terminal 2. If the terminal 
2 can display more than one approximation, the esti- 
mate may be transmitted by the network for visual 
display of more than one approximation and prompt 
the user by synthesized speech, etc., to choose. In 
the bidirectional protocol a terminal which does not 
respond to the prompt will be considered to not have 
any visual display 20 and the output 8 will be in the 
form of synthesized speech. 

With a terminal specific protocol, the network 1 
stores a table of the identities of each terminal 2 and 
utilizes a terminal specific protocol based on the in- 
formation on the specific terminal. This approach, 
however, would only be effective in a small network 
where a network administrator has control over all of 
the installed terminals. 

With a unidirectional protocol the network trans- 
mits both a digital feedback for visual display and an 
audio synthesized speech feedback for audio presen- 
tation to the user at the first terminal 2. The format is 
fixed and the specific terminal can ignore or display 
the digital feedback for visual display. Of course, this 
is the most simple protocol, however, it does not allow 
for customization to specific terminals. 

When the presentation is made to the user at the 
first terminal 2, the user is able to confirm a correct 
estimate. This includes the ability to indicate that a 
correct estimate is displayed or request another alter- 
native if additional alternatives are available. If a mul- 
ti-line display, e.g. a CRT display is used, the confir- 
mation means includes selection means to select 
from the approximations displayed, to scroll down or 
bring up a new screen of additional approximations, 
etc. Such means includes a keypad 24 ora micro- 
phone 22 for voice input. 

Additionally, the feedback can be augmented 
with other information resulting from the query of a 
database with a recognized input or the most closely 
matching approximations of an input. For example, in 
a telephone directory application response to an input 
signal may include an estimate including the most 
closely matching name or names together with the 
corresponding telephone numbers. Similarly, in an 
exchange request the cost of calling each of the ap- 
proximations can be included. In such applications 
the confirmation feature can include automatic dial- 
ing of the selected approximation or a request for the 
next screenful of approximations. 
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Example 

A user at a first terminal 2, having a visual display 
20 comprising a Caller I.D. device, associated with a 
network 1 which employs a speech recognition facili- 5 
ty lifts a handset and dictates a telephone number of 
another terminal 14 he wishes to be connected to into 
the microphone 22. An input signal 4 is placed onto 
the network 1 and is directed to a speech recognizer 
6. The speech recognizer 6 estimates the verbal con- w 
tent of the input signal 4, i.e. the telephone number re- 
cited by the user, producing an estimate of the verbal 
content of the input 4 comprising three (3) approxima- 
tions. 

The estimate is coded in a digital format and is 15 
transmitted as an output signal 8 in a DTMF format to 
the first terminal 2. The approximation considered 
most probable by the speech recognizer is displayed 
on the Caller I.D. LCD display 20 and a speech syn- 
thesized voice recites the number through the speak- 20 
er 28. If the displayed approximation is the one the 
user desires, the user depresses the "•" key on the 
keypad 24 to confirm the number, thereby directing 
the network to attempt to reach the other terminal 14 
associated with that number. A barge-in feature stops 25 
the speech synthesized voice when the key is de- 
pressed. 

If the first approximation displayed is not the cor- 
rect number, the user depresses the "#" key on the 
keypad 24 and the next most probable approximation 30 
appears. Again, a barge-in feature stops the synthe- 
sized voice reciting the first approximation and be- 
gins reciting the next approximation when the "#" key 
is depressed on the keypad 24. When the correct ap- 
proximation is displayed the user confirms the ap- 35 
proximation by depressing the key on the keypad 
24 and the call to the other exchange 14 is placed on 
the network 1. 

While the present invention has been described 
in detail with reference to specific embodiments 40 
thereof, it will be apparent to those skilled in the art 
that various changes and modifications can be made 
without departing from the scope of the invention. 



Claims 



45 



1. A speech recognizer system for use with a tele- 
communication network comprising first terminal 
means from which a user generates an input sig- 50 
nal onto the network, speech recognizer means 
in the network for providing an estimate of the 
content of the input signal, feedback means for di- 
recting an output signal of the estimate from the 
speech recognizer means to the first terminal 55 
means, said estimate comprising more than one 
approximation of the content of the input signal, 
and confirmation means at the first terminal 



means for confirming or selecting a correct ap- 
proximation of the content of the input signal. 

2. The speech recognizer system of claim 1 wherein 
the output signal comprises a digital signal. 

3. The speech recognizer system of claim 2 wherein 
the first terminal means comprises a modem and 
visual display means for presenting at least one 
of the more than one approximation of the esti- 
mate of the content of the input signal to the user 
at one time. 

4. The speech recognizer system of claim 3 wherein 
the visual display means is a caller I.D. device, an 
LCD display, a CRT display, a P.C. monitor, a vid- 
eo terminal display, a video telephone, a still im- 
age video display, a still image telephone, an 
LED display or a vacuum fluorescent display. 

5. The speech recognizer system of claim 1 wherein 
the output signal comprises synthesized speech 
in analogue form. 

6. The speech recognizer system of any of the pre- 
ceding claims wherein the output signal further 
comprises additional information related to the 
content of the input signal. 

7. The speech recognizer system of claim 6 wherein 
the content of the input signal comprises an iden- 
tification of second terminal means and the out- 
put signal comprises telephone numbers corre- 
sponding thereto. 

8. A method of speech recognition in a telecommu- 
nication network comprising the steps of estimat- 
ing within the network the content of an input sig- 
nal placed onto the network from first terminal 
means, and transmitting an output comprising an 
estimate of the content of the input signal back to 
the first terminal means for confirmation, said es- 
timate comprising more than one approximation 
of the input signal. 

9. The method of claim 8 further comprising the 
step of digitizing the output signal prior to the 
transmission to the first terminal means. 

1 0. The method of claim 8 or claim 9 further compris- 
ing visually displaying one or more of the approx- 
imations of the content of the input signal at the 
first terminal means. 

11. The method of any of claims 8 to 10 wherein the 
approximations are presented sequentially at the 
first terminal means with a most probable approx- 
imation presented first and a successive next 
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most probable alternative presented after said 
most probable approximation only if the most 
probable approximation has not been confirmed. 

12. Amethod of error reduction inspeech recognition 5 
in a telecommunications network comprising the 
steps of providing a n estimate of the content of an 
input signal placed on the network from a user at 
first terminal means back to the first terminal 
means, said estimate comprising more than one w 
approximation of the content of the input signal, 
and providing confirmation or selection of a cor- 
rect approximation from the first terminal means 
onto the network. 

15 



20 



25 



30 



35 



40 



45 



55 



EP 0 676 882 A2 

r 




1 

i 




6 



